In this homework assignment, we will explore, analyze and model a
data set containing approximately 8000 records representing a customer
at an auto insurance company. We will build multiple linear regression
models on the continuous variable TARGET_AMT and binary
logistic regression model on the boolean variable
TARGET_FLAG to predict the probability that a person will
crash their car, and to predict the associated costs.
We are going to build several models using different techniques and variable selection. In order to best assess our predictive models, we will create a validation set within our training data along an 80/20 training/testing proportion, before applying the finalized models to a separate evaluation dataset that does not contain the target.
The insurance training dataset contains 8161 observations of 26 variables, each record represents a customer at an auto insurance company. The evaluation dataset contains 2141 observations of 26 variables. These include demographic measures such as age and gender, socioeconomic measures such as education and household income, and vehicle-specific metrics such as car model, age and assessed value.
Each record also has two response variables. The first response
variable, TARGET_FLAG, is a boolean where “1” means that
the person was in a car crash. The second response variable,
TARGET_AMT is a numeric indicating the (positive) cost if a
car crash occurred; this value is zero if the person did not crash their
car.
We can explore a sample of the training data here, and make some initial observations:
z_ that could
be removed for readability.The table below provides valuable descriptive statistics about the training data:
| Name | train_df |
| Number of rows | 8161 |
| Number of columns | 25 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 15 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| PARENT1 | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| MSTATUS | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| SEX | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| EDUCATION | 0 | 1 | 3 | 12 | 0 | 5 | 0 |
| JOB | 0 | 1 | 6 | 12 | 0 | 9 | 0 |
| CAR_USE | 0 | 1 | 7 | 10 | 0 | 2 | 0 |
| CAR_TYPE | 0 | 1 | 3 | 11 | 0 | 6 | 0 |
| RED_CAR | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| REVOKED | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| URBANICITY | 0 | 1 | 19 | 19 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| TARGET_FLAG | 0 | 1.00 | 0.26 | 0.44 | 0 | 0 | 0 | 1 | 1.0 |
| TARGET_AMT | 0 | 1.00 | 1504.32 | 4704.03 | 0 | 0 | 0 | 1036 | 107586.1 |
| KIDSDRIV | 0 | 1.00 | 0.17 | 0.51 | 0 | 0 | 0 | 0 | 4.0 |
| AGE | 6 | 1.00 | 44.79 | 8.63 | 16 | 39 | 45 | 51 | 81.0 |
| HOMEKIDS | 0 | 1.00 | 0.72 | 1.12 | 0 | 0 | 0 | 1 | 5.0 |
| YOJ | 454 | 0.94 | 10.50 | 4.09 | 0 | 9 | 11 | 13 | 23.0 |
| INCOME | 445 | 0.95 | 61898.09 | 47572.68 | 0 | 28097 | 54028 | 85986 | 367030.0 |
| HOME_VAL | 464 | 0.94 | 154867.29 | 129123.77 | 0 | 0 | 161160 | 238724 | 885282.0 |
| TRAVTIME | 0 | 1.00 | 33.49 | 15.91 | 5 | 22 | 33 | 44 | 142.0 |
| BLUEBOOK | 0 | 1.00 | 15709.90 | 8419.73 | 1500 | 9280 | 14440 | 20850 | 69740.0 |
| TIF | 0 | 1.00 | 5.35 | 4.15 | 1 | 1 | 4 | 7 | 25.0 |
| OLDCLAIM | 0 | 1.00 | 4037.08 | 8777.14 | 0 | 0 | 0 | 4636 | 57037.0 |
| CLM_FREQ | 0 | 1.00 | 0.80 | 1.16 | 0 | 0 | 0 | 2 | 5.0 |
| MVR_PTS | 0 | 1.00 | 1.70 | 2.15 | 0 | 0 | 1 | 3 | 13.0 |
| CAR_AGE | 510 | 0.94 | 8.33 | 5.70 | -3 | 1 | 8 | 12 | 28.0 |
/
Based on this summary table and exploration of the data, we can make the following observations:
YOJ (6%),
INCOME (5%), HOME_VAL (6%),
CAR_AGE (6%), and AGE (1%).CAR_AGE has a negative value of -3, which
doesn’t make intuitive sense.Before building a model, we need to make sure that we have both
classes equally represented in our TARGET_FLAG variable.
Class 1 takes 27% and class 0 takes 63% of the
target variable. As a result, we have unbalanced class distribution for
our target variable that we have to deal with, we have to take some
additional steps (bootstrapping, etc) before using logistic
regression.
| Value | % |
|---|---|
| 0 | 0.74 |
| 1 | 0.26 |
Many of these distributions seem highly skewed and non-normal. As part of our data preparation we’ll use power transformations to find whether transforming variables to more normal distributions improves our models’ efficacy.
Commentary
Interestingly, none of our predictors appear to have strong linear
relationships to our TARGET_AMT response variable, which is
a primary assumption of linear regression. This suggests that
alternative methods might be more successful in modeling the
relationships.
Commentary
In order to work with our training dataset, we’ll need to first convert some variables to more useful data types:
INCOME,HOME_VAL,BLUEBOOK and
OLDCLAIM.TARGET_FLAG,
CAR_TYPE, CAR_USE, EDUCATION,
JOB, MSTATUS, PARENT1,
RED_CAR, REVOKED, SEX and
URBANICITY.Before we go further, we need to identify and handle any missing, NA or negative data values so we can perform log transformations and regression.
First, we’ll apply transformations to clean up and align formatting of our variables:
INDEX variable.JOB blank values with ‘Unknown’.Next, we’ll manually adjust two special cases of missing or outlier values.
YOJ is zero and INCOME is
NA, we’ll set INCOME to zero to avoid imputing new values
over legitimate instances of non-employment.CAR_AGE that is
less than zero - we’ll assume this is a data collection error and set it
to zero (representing a brand-new car.)We’ll use MICE to impute our remaining variables with missing values
- AGE, YOJ, CAR_AGE,
INCOME and HOME_VALUE:
Next we’ll want to consider any power transformations for variables
that have skewed distributions. For example, our numeric response
variable TARGET_AMT is a good candidate for transformation
as its distribution is very highly skewed, and the assumption of
normality is required in order to apply linear regression.
INCOME,
TARGET_AMT, OLDCLAIM to transform their
distributions from right-skewed to normally distributed.BLUEBOOK, TRAVTIME, TIF, so they
also are more normally distributed.To give our models more variables to work with, we’ll engineer some additional features:
CAR_AGE, HOME_VAL
and TIF.MALE,
MARRIED, LIC_REVOKED, CAR_RED,
PRIVATE_USE, SINGLE_PARENT and
URBAN.We can examine our final, transformed training dataset and
distributions below (with a temporary numeric variable
CAR_CRASH to represent the response variable for
visualization purposes.)
| Name | train_df |
| Number of rows | 8161 |
| Number of columns | 28 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| EDUCATION | 0 | 1 | FALSE | 5 | Hig: 2330, Bac: 2242, Mas: 1658, <Hi: 1203 |
| JOB | 0 | 1 | FALSE | 9 | Blu: 1825, Cle: 1271, Pro: 1117, Man: 988 |
| CAR_TYPE | 0 | 1 | FALSE | 6 | SUV: 2294, Min: 2145, Pic: 1389, Spo: 907 |
| MALE | 0 | 1 | FALSE | 2 | 0: 4375, 1: 3786 |
| MARRIED | 0 | 1 | FALSE | 2 | 1: 4894, 0: 3267 |
| LIC_REVOKED | 0 | 1 | FALSE | 2 | 0: 7161, 1: 1000 |
| CAR_RED | 0 | 1 | FALSE | 2 | 0: 5783, 1: 2378 |
| PRIVATE_USE | 0 | 1 | FALSE | 2 | 1: 5132, 0: 3029 |
| SINGLE_PARENT | 0 | 1 | FALSE | 2 | 0: 7084, 1: 1077 |
| URBAN | 0 | 1 | FALSE | 2 | 1: 6492, 0: 1669 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| TARGET_FLAG | 0 | 1 | 0.26 | 0.44 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
| TARGET_AMT | 0 | 1 | -1.21 | 5.69 | -4.61 | -4.61 | -4.61 | 6.94 | 11.59 |
| KIDSDRIV | 0 | 1 | 0.17 | 0.51 | 0.00 | 0.00 | 0.00 | 0.00 | 4.00 |
| AGE | 0 | 1 | 44.78 | 8.63 | 16.00 | 39.00 | 45.00 | 51.00 | 81.00 |
| HOMEKIDS | 0 | 1 | 0.72 | 1.12 | 0.00 | 0.00 | 0.00 | 1.00 | 5.00 |
| YOJ | 0 | 1 | 10.48 | 4.10 | 0.00 | 9.00 | 11.00 | 13.00 | 23.00 |
| INCOME | 0 | 1 | 61461.41 | 47434.08 | 0.00 | 27684.00 | 53483.00 | 85479.00 | 367030.00 |
| HOME_VAL | 0 | 1 | 154644.85 | 129340.41 | 0.00 | 0.00 | 160874.00 | 238349.00 | 885282.00 |
| TRAVTIME | 0 | 1 | 15.08 | 5.82 | 3.00 | 11.17 | 15.34 | 19.12 | 45.61 |
| BLUEBOOK | 0 | 1 | 182.32 | 48.72 | 62.21 | 147.95 | 182.18 | 216.49 | 381.01 |
| TIF | 0 | 1 | 1.63 | 1.25 | 0.00 | 0.00 | 1.62 | 2.43 | 4.70 |
| OLDCLAIM | 0 | 1 | 0.56 | 6.54 | -4.61 | -4.61 | -4.61 | 8.44 | 10.95 |
| CLM_FREQ | 0 | 1 | 0.80 | 1.16 | 0.00 | 0.00 | 0.00 | 2.00 | 5.00 |
| MVR_PTS | 0 | 1 | 1.70 | 2.15 | 0.00 | 0.00 | 1.00 | 3.00 | 13.00 |
| CAR_AGE | 0 | 1 | 8.34 | 5.70 | 0.00 | 1.00 | 8.00 | 12.00 | 28.00 |
| CAR_AGE_BIN | 0 | 1 | 2.48 | 1.12 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 |
| HOME_VAL_BIN | 0 | 1 | 2.45 | 1.16 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 |
| TIF_BIN | 0 | 1 | 2.41 | 1.16 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 |
We can use Mosaic Plots to illustrate the relationship of binary
factor variables to TARGET_FLAG:
We can also use Mosaic Plots to illustrate the relationship of
multi-level factor variables to TARGET_FLAG:
To proceed with modeling, we’ll split our training data into train (80%) and validation (20%) datasets.
We’ll use Multiple Linear Regression to model the
TARGET_AMT response variable, the estimated cost of a crash
for a given observation.
The cv.glmnet() function was used to perform k-fold
cross-validation with variable selection using lasso regularization. The
following attribute settings were selected for the model:
TARGET_AMT and 0.7362 / n for all observations with all
other values of TARGET_AMT.The resulting model is explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.
The coefficients extracted using lambda.min minimizes the mean cross-validated error. The resulting model includes 33 non-zero coefficients and has an AIC of 60.08. The coefficients extracted using lambda.1se produce the most regularized model (cross-validated error is within one standard error of the minimum). For this model there are 25 non-zero coefficients and it has an AIC of 44.23
The coefficients extracted using lambda.1se results in the lowest AIC (highest model performance) with fewer predictor variables.
##
## Call: cv.glmnet(x = X, y = Y, weights = weights_lm, type.measure = "mse", nfolds = 10, family = "gaussian", standardize = TRUE, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.02281 48 30.37 0.5402 35
## 1se 0.16092 27 30.85 0.4708 27
## $mse
## lambda.min
## 30.2152
## attr(,"measure")
## [1] "Mean-Squared Error"
##
## $mae
## lambda.min
## 4.756945
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] 65.82961
##
## $BIC
## [1] 302.8764
## $AICc
## [1] 49.90845
##
## $BIC
## [1] 232.8399
A closer look at the remaining 37 non-zero coefficients for the
selected lambda value of lambda.min (0.023) we can observe the top
predictor variables URBANICITY Highly Urban/ Urban
predictor variable has the largest impact on the response variable
TARGET_AMT.
In the lasso model the coefficient for
URBANICITY Highly Urban/ Urban home work area is biggest
contributor to the cost estimates of a car crash by a factor of 2.
Reviewing the top 5 predictor variables that impact likelihood and cost associated with an accident:
URBANICITY Highly Urban/ Urban - working or living in
an urban neighborhood increase expected cost associated with a
crashJOB Doctor - being a doctor reduces the expected costs
associated with a crashJOB Manager - being a manager reduces the expected
costs associated with a crashCAR_TYPE Sports Car - owning a sports car increases the
expected costs associated with a crashCAR_USE Private - using a car for private activities
reduces the expected costs associated with a crashREVOKED Yes - a history of having a revoked license
increases the expected costs associated with a crashSome of the notable coefficients that drop out of the model include:
HOME_VALJOB ProfessionalCAR_AGECAR_RED## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 4.659086e-01
## KIDSDRIV 9.482740e-01
## AGE -6.302475e-03
## HOMEKIDS 7.138829e-02
## YOJ -9.512408e-03
## INCOME -9.310518e-06
## HOME_VAL .
## EDUCATIONBachelors -8.229800e-01
## EDUCATIONHigh School 2.545373e-02
## EDUCATIONMasters -7.044465e-01
## EDUCATIONPhD -2.771949e-01
## JOBClerical 2.184793e-01
## JOBDoctor -2.371073e+00
## JOBHome Maker -5.261066e-02
## JOBLawyer -4.410511e-01
## JOBManager -2.107991e+00
## JOBProfessional .
## JOBStudent -9.671921e-02
## JOBUnknown -5.052147e-01
## TRAVTIME 8.625996e-02
## BLUEBOOK -9.206580e-03
## TIF -2.545276e-01
## CAR_TYPEPanel Truck 5.280531e-01
## CAR_TYPEPickup 7.602791e-01
## CAR_TYPESports Car 1.940375e+00
## CAR_TYPESUV 1.223260e+00
## CAR_TYPEVan 1.044559e+00
## OLDCLAIM 7.704577e-02
## CLM_FREQ 8.117799e-02
## MVR_PTS 2.366235e-01
## CAR_AGE .
## CAR_AGE_BIN .
## HOME_VAL_BIN -2.829123e-01
## TIF_BIN -2.167580e-01
## MALE1 1.505128e-01
## MARRIED1 -1.080955e+00
## LIC_REVOKED1 1.566145e+00
## CAR_RED1 .
## PRIVATE_USE1 -1.794435e+00
## SINGLE_PARENT1 8.514047e-01
## URBAN1 5.129086e+00
As mentioned earlier, the dataset has a high correlation between predictor variables. The lasso regression approaches this issue by selecting the variable with the highest correlation and shrinking the remaining variables (as can be seen in the plot of coefficients).
The lasso model using coefficients extracted at lambda.1se was used to predict the 60,421 test cases and comparing the predicted insurance AMT to the actual cost of a car crash. The predicted cost of the crash include negative numbers that are effectively 0. We selected a threshold cost and assigning 0 to all amounts below that threshold value. Since
In the training data 337.50 was the lowest crash cost included in the dataset. We used 100 as the measurement threshold and assume that all predicted costs below 100 dollars are effectively 0.
Using the yardstick package to measure model performance, mape, smape
and mpe return NaNs while the mase = 0.578 and
rmse = 4984.
## # A tibble: 5 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mape standard 1301.
## 2 smape standard 199.
## 3 mase standard 16.5
## 4 mpe standard -83.5
## 5 rmse standard 756.
Analyzing a scatter plot of the prediction errors vs measured costs and a comparative histogram of the predicted and measured costs highlights a shortcoming of the lasso model. The model consistently predicts crash costs lower than the actual measured crash costs. The gap is more pronounced when looking at predicted and ??????
To reduce multicollinearity we can use regularization that means to keep all the features but reducing the magnitude of the coefficients of the model. This is a good solution when each predictor contributes to predict the dependent variable.
The Standardized Residuals plot shows increasing variance at higher values of the response variable.
The lasso regression solves the multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.
| term | estimate | p.value |
|---|---|---|
| URBAN1 | 3.741 | 0.000 |
| LIC_REVOKED1 | 1.719 | 0.000 |
| MVR_PTS | 0.256 | 0.000 |
| PRIVATE_USE1 | -1.516 | 0.000 |
| TIF | -0.366 | 0.000 |
| CAR_TYPESports Car | 1.649 | 0.000 |
| KIDSDRIV | 0.869 | 0.000 |
| JOBManager | -1.831 | 0.000 |
| TRAVTIME | 0.069 | 0.000 |
| CAR_TYPESUV | 0.991 | 0.000 |
| BLUEBOOK | -0.009 | 0.000 |
| OLDCLAIM | 0.060 | 0.000 |
| SINGLE_PARENT1 | 1.137 | 0.000 |
| MARRIED1 | -0.830 | 0.000 |
| CAR_TYPEPickup | 0.789 | 0.000 |
| CAR_TYPEVan | 0.947 | 0.000 |
| INCOME | 0.000 | 0.001 |
| JOBDoctor | -1.779 | 0.001 |
| HOME_VAL_BIN | -0.575 | 0.001 |
| EDUCATIONBachelors | -0.702 | 0.003 |
| JOBUnknown | -0.999 | 0.012 |
| HOME_VAL | 0.000 | 0.034 |
| CAR_TYPEPanel Truck | 0.621 | 0.051 |
| JOBLawyer | -0.653 | 0.088 |
| (Intercept) | -0.880 | 0.109 |
| EDUCATIONMasters | -0.518 | 0.123 |
| YOJ | -0.027 | 0.133 |
| JOBStudent | -0.381 | 0.195 |
| JOBProfessional | -0.278 | 0.289 |
| JOBHome Maker | -0.328 | 0.323 |
| EDUCATIONPhD | -0.304 | 0.457 |
| EDUCATIONHigh School | 0.065 | 0.757 |
| JOBClerical | 0.030 | 0.900 |
The resulting model is much more parsimonious than the first, with statistically significant results for three predictors, bluebook, mvr_pts and mstatus_yes.
The Adjusted R-Squared is better than Model 1 but still very low (0.0167) meaning this model only explains about 1.7% of the total variance in the response variable target_amt. However, an examination of the residuals indicates most of the key assumptions for linear regression are met - the Residuals vs Fitted plot shows a more constant variability of the residuals, and the Q-Q plot indicates a greater level of normality.
The summary table includes the estimate transformed to original scale for easier interpretation.In this case, the ‘base’ target_amt would be estimated at $3,434.76 with an increase in 1% per each dollar of bluebook value, a 1.07% increase if the driver were male, and 0.9% decrease if the driver were married.
We’ll use Binary Logistic Regression to classify our response
variable TARGET_FLAG, the probability of a car crash for a
given observation.
Lasso Regression may be a good candidate for this dataset, since we are dealing with a large number of complex variables. Lasso helps identify the most important variables and reduces the model complexity.
The cv.glmnet() function was also used as logistic regression model. Similar to the regression model k-fold cross-validation was performed with variable selection using lasso regularization. The following attribute settings were selected for the model:
TARGET_FLAG and 0.7362 / n observations with a 1 value of
TARGET_FLAG.The resulting model is explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.
The coefficients extracted using lambda.min results in the lowest AIC and highest performance model.
##
## Call: cv.glmnet(x = X, y = Y, nfolds = 5, family = "binomial", link = "logit", standardize = TRUE, alpha = 1)
##
## Measure: Binomial Deviance
##
## Lambda Index Measure SE Nonzero
## min 0.001364 48 0.9064 0.006829 35
## 1se 0.006043 32 0.9129 0.005470 28
## $deviance
## lambda.min
## 0.8952216
## attr(,"measure")
## [1] "Binomial Deviance"
##
## $class
## lambda.min
## 0.2052696
## attr(,"measure")
## [1] "Misclassification Error"
##
## $auc
## [1] 0.8128385
## attr(,"measure")
## [1] "AUC"
##
## $mse
## lambda.min
## 0.2899812
## attr(,"measure")
## [1] "Mean-Squared Error"
##
## $mae
## lambda.min
## 0.5863108
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -1618.664
##
## $BIC
## [1] -1381.617
## $AICc
## [1] -1576.707
##
## $BIC
## [1] -1387.009
A closer look at the remaining 36 non-zero coefficients for the
selected lambda value of lambda.min we can observe the
URBANICITY Highly Urban/ Urban predictor variable has the
largest impact on the prediction of a car crash by a factor of three.
Reviewing the top 5 predictor variables that impact likelihood and cost
associated with an accident:
URBANICITY Highly Urban/ Urban - working or living in
an urban neighborhood increase expected cost associated with a
crashCAR_USE Private - using a car for private activities
reducesJOB Manager - being a manager reduces the expected
costs associated with a crashREVOKED Yes - a history of having a revoked license
increases the expected costs associated with a crashJOB Doctor - being a doctor reduces the expected costs
associated with a crashThe JOBStudent coefficient is the only predictor variable that drops out, however several variable including INCOME, HOME_VAL and OLDCLAIM are srunk substantially.
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.503148e+00
## KIDSDRIV 4.104985e-01
## AGE -2.985770e-03
## HOMEKIDS 1.467625e-02
## YOJ -5.603125e-03
## INCOME -3.789459e-06
## HOME_VAL .
## EDUCATIONBachelors -3.280895e-01
## EDUCATIONHigh School 1.985406e-02
## EDUCATIONMasters -3.554292e-01
## EDUCATIONPhD -1.740929e-01
## JOBClerical 9.929721e-02
## JOBDoctor -7.568191e-01
## JOBHome Maker .
## JOBLawyer -5.063537e-02
## JOBManager -7.474078e-01
## JOBProfessional -2.970786e-04
## JOBStudent -6.706470e-02
## JOBUnknown -1.378131e-01
## TRAVTIME 3.655349e-02
## BLUEBOOK -4.750919e-03
## TIF -6.142600e-02
## CAR_TYPEPanel Truck 3.134704e-01
## CAR_TYPEPickup 3.633557e-01
## CAR_TYPESports Car 8.355260e-01
## CAR_TYPESUV 4.954207e-01
## CAR_TYPEVan 4.529227e-01
## OLDCLAIM 2.156472e-02
## CLM_FREQ 5.234470e-02
## MVR_PTS 1.014769e-01
## CAR_AGE .
## CAR_AGE_BIN .
## HOME_VAL_BIN -1.209622e-01
## TIF_BIN -1.325964e-01
## MALE1 2.502728e-02
## MARRIED1 -4.244342e-01
## LIC_REVOKED1 7.337489e-01
## CAR_RED1 .
## PRIVATE_USE1 -7.770794e-01
## SINGLE_PARENT1 4.101707e-01
## URBAN1 2.218263e+00
The coefficients extracted at the lambda.min value are used to predict the likelihood of an accident. The confusion matrix highlights an accuracy of 73.7%.
## True
## Predicted 0 1 Total
## 0 1108 266 1374
## 1 94 165 259
## Total 1202 431 1633
##
## Percent Correct: 0.7795
Again we check linear relationship between independent variables and the Logit of the target variable. Visually inspecting the results there is a linear trend in the relationship but there are deviations from the straight line in all variables. The lasso regression solves the multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.
** Model 3.1: Lasso Linear Regression **
** Model 4.1: Lasso Logistic Regression **
## F1
## 0.4782609
Regression
| Model | mape | smape | mase | mpe | rmse | AIC |
|---|---|---|---|---|---|---|
| M4.1:Lasso Linear | 1301.105 | 199.0998 | 16.4806 | -83.4733 | 756.1637 | 65.8296 |
Logistic
| Model | Accuracy | Classification error rate | F1 | Deviance | R2 | Sensitivity | Specificity | Precision | AIC |
|---|---|---|---|---|---|---|---|---|---|
| M4.1:Lasso Logistic | 0.8132 | 0.1868 | 0.4783 | 0.8978 | NA | 0.3828 | 0.9218 | 0.6371 | -1618.664 |